Predicting Bank Deposit Subcriptions Using Decision Tree Classifier¶

Goal :
The objective of this project is to build a decision tree classifier that accurately predicts whether a client will subscribe to a bank deposit based on their demographic and behavioral data. We aim to identify key factors influencing deposit subscriptions through our analysis then using that information to build our predictive model and provide actionable insights to improve marketing strategies.

About Dataset:
The dataset was provided by Prodigy InfoTech from the UCI Machine Learning repository. It contains information about clients and their interactions with the bank's marketing efforts which mostly includes demographic data such as age, job, marital status, education level etc and behavioral data such as past campaign success, contact methods etc. It is a well known dataset for predicting the success of bank marketing campaigns.
Link to the dataset : Bank

Importing Libraries¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

from plotly.subplots import make_subplots
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Importing Dataset¶

In [2]:
df = pd.read_csv('C:/Users/obalabi adepoju/Downloads/bank.csv')

Data Cleaning & Inspection¶

We'll look at a general overview of our data and a description of each column.

In [3]:
df.head(10)
Out[3]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
5 35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
6 28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no
7 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
9 43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB
In [5]:
print(f"This dataset contains {df.shape[0]} columns and {df.shape[1]} rows")
This dataset contains 45211 columns and 17 rows

Here's a description of each column in the Dataset:

Feature Variable:

  1. age: Age of the client. The client's age in years.

  2. job: Type of job. The occupation of the client (e.g., "admin.", "blue-collar", "entrepreneur").

  3. marital: Marital status. The client's marital status (e.g., "married", "single", "divorced").

  4. education: Education level. The highest education level attained by the client (e.g., "primary", "secondary", "tertiary").

  5. default: Credit in default. Whether the client has credit in default ("yes", "no").

  • Credit in default refers to a situation where a borrower has failed to meet their debt repayment obligations according to the terms of their credit agreement. Specifically, it means that the borrower has missed payments or has been unable to repay the borrowed amount as scheduled.
  1. balance: Average yearly balance in euros. The average balance of the client's account over the past year.

  2. housing: Housing loan. Whether the client has a housing loan ("yes", "no").

  3. loan: Personal loan. Whether the client has a personal loan ("yes", "no").

  4. contact: Communication type. The type of communication used for the last contact ("telephone", "cellular").

  5. day: Last contact day of the month. The day of the month when the last contact was made durng the current marketing campaign.

  6. month: Last contact month of the year. The month when the last contact was made (e.g., "jan", "feb", "mar").

  7. duration: Last contact duration in seconds. The duration of the last contact in seconds.

  8. campaign: Number of contacts performed during this campaign. The number of times the client was contacted during the current campaign.

  9. pdays: Number of days since the client was last contacted from a previous campaign. The number of days since the client was last contacted in a previous campaign; '-1' indicates the client was not previously contacted.

  10. previous: Number of contacts performed before this campaign. The number of contacts the client received before the current campaign.

  11. poutcome: Outcome of the previous marketing campaign. The result of the previous marketing campaign ("unknown", "other", "failure", "success").

Target Variable:

  1. y: Subscription to term deposit. Indicates whether the client subscribed to a term deposit ("yes", "no").

We'll be going through and cleaning all the important factors that contribute or affect our target variable 'y' and I'll start with our age column.

Note: Unknown is the value given to null values in our data so wherever we have an unknown value, it means null.

In [6]:
fig = px.violin(df,x='age',title='Age Distribution',color_discrete_sequence = ['dodgerblue'])
fig.show()
  • Our plot shows the range of our data which is from the typical 18, to 95 years. It shows where the bulk of ages is situated i.e most customers are within the age of 30 and 50, to be more specific (33 and 48) with half of our data being either 39 or below. It also draws our attention to a few odd bunches which include elderly ages which do not conform to the general cluster of our data and it includes those above the age of 70.

Job¶

In [7]:
#checking for null values
df[['job']][df.job == 'unknown'].count()
Out[7]:
job    288
dtype: int64
In [8]:
# The null values aren't much at all so we'll get rid of them
df = df[df.job != 'unknown']
In [9]:
df['job'].value_counts()
Out[9]:
job
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
Name: count, dtype: int64

Blue-Collar Jobs represent the largest group indicating a strong presence of manual labor or skilled trades among the customers in our data and coming in close second is the Management, showing a significant portion of the population involved in leadership and decision-making roles. We also see middle occupations which include the technician to services roles and the smaller occupations consisting of retirees, those engaged in self driven activities, the unemployed and domestic workers.

Marital¶

In [10]:
# Next we want to look at our marital column
fig = px.pie(df,'marital',title = 'Marriage Distribution', color='marital',hole=0.5)
fig.show()
It's unsurprising that the majority of customers are married, with single individuals making up a smaller portion, and divorcees representing the smallest group.

Education¶

In [11]:
# we check for null values
df[df.education == 'unknown']
Out[11]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
13 58 technician married unknown no 71 yes no unknown 5 may 71 1 -1 0 unknown no
16 45 admin. single unknown no 13 yes no unknown 5 may 98 1 -1 0 unknown no
42 60 blue-collar married unknown no 104 yes no unknown 5 may 22 1 -1 0 unknown no
44 58 retired married unknown no 96 yes no unknown 5 may 616 1 -1 0 unknown no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45098 44 technician single unknown no 11115 no no cellular 25 oct 189 1 185 4 success no
45109 78 management married unknown no 1780 yes no cellular 25 oct 211 2 185 7 success yes
45129 46 technician married unknown no 3308 no no cellular 27 oct 171 1 91 2 success yes
45150 65 management married unknown no 2352 no no cellular 8 nov 354 3 188 13 success no
45158 34 student single unknown no 2321 no no cellular 9 nov 600 2 99 5 failure no

1730 rows × 17 columns

In [12]:
# Since the number of unknown values is small relative to our data, we'll remove it
df = df[df.education != 'unknown']
In [13]:
fig = px.histogram(df,'education',title = 'Education Distribution', color_discrete_sequence=['mediumseagreen'])
fig.show()

Default Credit¶

In [14]:
# Next we'll be looking at our default credit column to understand it's distribution 
df['default'].value_counts()
Out[14]:
default
no     42411
yes      782
Name: count, dtype: int64

Impressively, only about 1.8% of people have unresolved loans, indicating a strong financial responsibility among the bank's customers.

Average Balance¶

In [15]:
# next we want to check the distribution of the average balance of customers for the year
fig = px.histogram(df,'balance',title = 'Average Balance Distribution', color_discrete_sequence=['mediumseagreen'])
fig.show()

Our histogram reveals that the most prominent balance range among customers is between 0 and 99, with the number of individuals decreasing as balances increase. Notably, the count of people with higher balances drops significantly from around 4,000 onward, becoming sparse beyond that point. Interestingly, a small segment of the population, accounting for 8.3% of our data, holds negative balances, with some reaching as low as -7,000. This highlights a minority of customers who are in debt.

Let's see what this would look like after filtering our outliers.

In [16]:
Q1 = df['balance'].quantile(0.25)
Q3 = df.balance.quantile(0.75)

IQR = Q3-Q1
upper_fence = Q3 + 1.5 * IQR
lower_fence = Q1 - 1.5 * IQR

d = df[(df.balance > lower_fence) & (df.balance < upper_fence)]
fig = px.histogram(d,'balance',title = 'Average Balance Distribution', color_discrete_sequence=['mediumseagreen'])
fig.show()

Contact¶

In [17]:
#Let's check for null values
df[['contact']][df.contact == 'unknown'].count()
Out[17]:
contact    12286
dtype: int64

Woah! Those are a whole lot of unknowns and due to their sheer size, we are unable to simply remove them so we'll replace them with the most occuring category.

In [18]:
df['contact'].replace(['unknown'],df['contact'].mode(),inplace=True)
In [19]:
# Let's check out its distribution
fig = px.pie(df,'contact',title = 'Contact Distribution', color='contact',hole=0.5,
             color_discrete_map = {'cellular':'mediumpurple','telephone':'lavender'})
fig.show()

Month¶

In [20]:
#Let's check for null values
df[['month']][df.month == 'unknown'].count()
Out[20]:
month    0
dtype: int64
In [21]:
fig = px.histogram(df,'month',title = 'Month Distribution', color_discrete_sequence=['#EF553B'])
fig.show()
We see that most of our records are in May, with July and August coming in as close seconds, followed by June and the least of all, probably due to holiday cheers is the month of december.

Duration¶

In [22]:
# Now we want to focus on the duration of the contact call
fig = px.histogram(df,'duration',title = 'Duration Distribution', color_discrete_sequence=['mediumpurple'])
fig.show()
Highest duration of calls are from 80 seconds to 130 seconds and that number reduces drastically as the calls get longer.  

Let's see what this would look like without extremities.

In [23]:
Q1 = df['duration'].quantile(0.25)
Q3 = df.duration.quantile(0.75)

IQR = Q3-Q1
upper_fence = Q3 + 1.5 * IQR

d = df[df.duration < upper_fence]
fig = px.histogram(d,'duration',title = 'Duration Distribution', color_discrete_sequence=['mediumpurple'])
fig.show()

Campaign¶

In [24]:
fig = px.histogram(df,'campaign',title = 'Campaign Distribution', color_discrete_sequence=['royalblue'])
fig.show()

We see most customers were contacted either once or twice during the course of the entire campaign amd the number of people contacted above that quickly thins out as the contacts increase, we see several odd values that shows quite a lot of people were contacted above 20 times.

Exploratory Data Analysis¶

For this next phase in this project, we'll be taking a different approach by dividing our data into those with the target "yes" and "no" and focusing our magnifying glasses on the differences between these two populations to gain insights into the main factors surrounding the outcome and understand how we'll undergo our feature selection process. We want to understand the special characteristics of those who made a deposit in comparison to those who didn't

In [25]:
dfyes = df[df['y'] == 'yes']
dfno = df[df['y'] == 'no']
In [26]:
print((len(dfno)/len(df)) * 100)
88.3754312041303

We see that our no population fairly represents our normal data occupying 88 % of our overall data.

Creating Our Functions¶

In [27]:
def viz(title,column,c1,c2):
    # Create subplots: 1 row, 2 columns
    fig = make_subplots(rows=1, cols=2, subplot_titles=("Population (Yes)", "Population (No)"))
    
    if df[column].dtype == 'int64':
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)

        IQR = Q3-Q1
        upper_fence = Q3 + 1.5 * IQR
        lower_fence = Q1 - 1.5 * IQR
        
        dfn = dfno[(dfno[column] > lower_fence) & (dfno[column] < upper_fence)]
        dfy = dfyes[(dfyes[column] > lower_fence) & (dfyes[column] < upper_fence)]
    else:
        dfn = dfno
        dfy = dfyes

    # First histogram for 'Yes'
    fig.add_trace(
        go.Histogram(x=dfy[column], name='Yes', marker_color=c1,marker_line_color='black', marker_line_width=1),
        row=1, col=1
    )
    
    # Second histogram for 'No'
    fig.add_trace(
        go.Histogram(x=dfn[column], name='No', marker_color=c2,marker_line_color='black',marker_line_width=1),
        row=1, col=2
    )

    # Update layout
    fig.update_layout(title_text= title + " Distribution by Deposit Outcome", showlegend=False)

    # Show plot
    fig.show()
    
def vizv(title,column,c1,c2):
    # Create subplots: 1 row, 2 columns
    fig = make_subplots(rows=1, cols=2, subplot_titles=("Population (Yes)", "Population (No)"))
    
    if df[column].dtype == 'int64':
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)

        IQR = Q3-Q1
        upper_fence = Q3 + 1.5 * IQR
        lower_fence = Q1 - 1.5 * IQR
        
        dfn = dfno[(dfno[column] > lower_fence) & (dfno[column] < upper_fence)]
        dfy = dfyes[(dfyes[column] > lower_fence) & (dfyes[column] < upper_fence)]
    else:
        dfn = dfno
        dfy = dfyes

    # First histogram for 'Yes'
    fig.add_trace(
        go.Violin(x=dfy[column], name='Yes', marker_color=c1),
        row=1, col=1
    )

    # Second histogram for 'No'
    fig.add_trace(
        go.Violin(x=dfn[column], name='No', marker_color=c2),
        row=1, col=2
    )

    # Update layout
    fig.update_layout(title_text= title + " Distribution by Deposit Outcome", showlegend=False)

    # Show plot
    fig.show()  

Education¶

We'll be starting with our Education column.

In [28]:
vizv('Education','education','mediumseagreen','mediumpurple')

Through our visualizations, we're aiming to spot out differences between our "no" population and our "yes" population. With that, let's get into it.

  • The most prominent difference between these two poupulations is that while those having attained a tertiary education level are much less than the secondary in our "no" population, we see this difference is not nearly as significant with those who said yes. This difference meets the bar we have set in determining which features will be useful in our model.

Contact¶

Just in case you might question the importance of this feature to our target, I'd like to highlight that preferrence for a specific method of communication influences how effectively one be receptive for what you're trying to pass across.
For example : People typically prefer a call or a voice note if you're trying to disseminate lengthy information but I would prefer a text or written information and it doesn't matter how long as long as it's consise. This is just to say people have different likes and that may influence their perception.

In [29]:
viz('Contact','contact','royalblue','mediumpurple')
  • We see the generally preferred method is cellular regardless of which segment they belong to.

Month¶

In [30]:
viz('Monthly','month','mediumseagreen','royalblue')
  • May is generally the highest as we know most contacts were contacted in may but other than that, there are obvious differences in these two charts most notably with august being the second highest and the month of april occupying a significantly larger portion than its counterpart in the No populace.

Campaign¶

In [31]:
viz('Campaign','campaign','#EF553B','mediumpurple')
  • It wouldn't have been far fetched to say those with mostly 1 or 2 contacts during the course of the campaign would have higher chances of a yes but we see this is also the case with our No population. There doesn't seem to be any notable difference in the distribution for this column so we won't be considering it in our model.

Job¶

In [32]:
viz('Occupation','job','royalblue','#EF553B')
  • First and foremost, let’s point out that this is a clear indication that our general data mostly mirrors our 'no' population, as the distributions are intrinsically similar. However, our 'yes' responses reveal something different. Most notably, the prevalence of the management occupation is followed by technicians and then blue-collar workers. Based on this information, it might be safe to say that fewer blue-collar workers said 'yes' compared to their general numbers. A likely reason for the high number of 'yes' responses among blue-collar workers could be their large population, but let’s not delve into that. We see a disparity between the two distributions, so we’ll be considering this column.

Balance¶

In [33]:
viz('Average Balance','balance','deepskyblue','mediumpurple')
  • In this column, the only notable differences seems to be the color of the plot and our no population contain more negative values which suggests however slightly lower chances of those with negative balances saying "Yes".

Housing (Mortgage)¶

In [34]:
vizv('Mortgage','housing','blue','mediumseagreen')
  • Perhaps the most notable difference we've seen so far, our yes population shows a significant portion of its population are those without any mortgage while our no's tell us a contrasting story. We know this because the plots are inverted to each other. This feature will be very useful in building our model.

Loan¶

In [35]:
viz("Loan",'loan','#EF553B','#EF553B')
  • The difference here is in the population of those who have personal loans and it isn't very significant but coupled with our last insight, it tells us that a significant number of our yes population are those without loans or mortgages.

Day¶

In [36]:
viz("Daily",'day','blue','mediumseagreen')
  • Here we clearly see that both distributions have different shapes and perhaps the most intriguing difference is the higher count on the 30th for those who said "yes" which suggests a significant surge in positive responses on that specific day. This could indicate higher chances of a customer saying yes towards the end of the month.

Duration¶

In [37]:
viz('Duration','duration','royalblue','royalblue')
  • Our distributions are very identical with the only difference being a lower duration of calls is very low among those who said yes when compared to our no's, indicating that brief interactions were less effective in converting these individuals.

Insights¶

  1. Education Level Discrepancy: There is a notable difference in education levels between the 'yes' and 'no' populations. Tertiary education is less common among 'no' responders but less significantly so among 'yes' responders, indicating education level is a useful feature for modeling.

  2. Preferred Contact Method: Cellular contact is the preferred method across both segments, showing no significant difference between 'yes' and 'no' populations.

  3. Monthly Contact Patterns: May has the highest number of contacts, with August and April showing notable differences between 'yes' and 'no' populations. This indicates seasonality in contact effectiveness.

  4. Occupation Influence: Management and technician occupations are more prevalent among 'yes' responders, while blue-collar workers, despite their large numbers, are less likely to say 'yes' compared to their general population.

  5. Mortgage and Loan Status: A significant portion of 'yes' responders do not have a mortgage, contrasting with the 'no' responders. This suggests mortgage status is an important feature for predicting positive outcomes.

  6. Effectiveness of Call Duration: Shorter call durations are less effective in converting individuals to 'yes,' suggesting longer interactions may be more successful in achieving positive responses.

Feature Selection¶

We'll start off this next step by encoding our variables.

In [38]:
# One-hot encode 'job', 'month', and 'contact'
df_encoded = pd.get_dummies(df, columns=['job','contact','month'], drop_first=True)

# Apply label encoding to 'education'
label_encoder = LabelEncoder()
df_encoded['education'] = label_encoder.fit_transform(df['education'])

# Transforming our yes/no columns to 1's and 0's
df_encoded['housing'] = df_encoded['housing'].map({'yes': 1, 'no': 0})
df_encoded['loan'] = df_encoded['loan'].map({'yes': 1, 'no': 0})
df_encoded['y'] = df_encoded['y'].map({'yes': 1, 'no': 0})
df_encoded['default'] = df_encoded['default'].map({'yes': 1, 'no': 0})
In [39]:
# Dropping columns we won't be making use of in our model
df_encoded.drop(columns=['marital','poutcome','previous','pdays','age','campaign'],inplace=True)

# Shuffle the DataFrame
data = df_encoded.sample(frac=1, random_state=42).reset_index(drop=True)
In [40]:
# An example of what our data looks like now
data
Out[40]:
education default balance housing loan day duration y job_blue-collar job_entrepreneur ... month_dec month_feb month_jan month_jul month_jun month_mar month_may month_nov month_oct month_sep
0 1 0 3417 0 0 12 134 1 False False ... False False False False False False False True False False
1 2 0 5506 0 1 6 141 0 False True ... False False False False True False False False False False
2 0 0 556 1 0 14 227 0 True False ... False False False False False False True False False False
3 0 0 1406 1 0 14 252 0 True False ... False False False False False False True False False False
4 1 0 397 1 0 23 252 0 False False ... False False False False False False True False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
43188 1 0 181 1 0 28 230 0 False False ... False False False False False False True False False False
43189 1 0 1483 0 0 20 32 0 False False ... False False False False True False False False False False
43190 1 0 2087 0 0 1 111 0 False False ... False False False False True False False False False False
43191 1 0 528 0 0 7 274 0 False False ... False False False False False False True False False False
43192 1 0 204 1 0 24 38 0 False False ... False False False True False False False False False False

43193 rows × 30 columns

In [41]:
#Define features (X) and target (y)
X = data.drop('y', axis=1) 
y = data['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

Handling Imbalance¶

As we previously observed, the number of people with "No's" takes up 88% of our entire data, and while data modeling is still pretty cool, it's not magic. The dataset is imbalanced, and the population of those who said "Yes" stands firmly in the minority class. Luckily, we have a way to take care of this. Permit me to ramble a bit here. We are going to balance our data by oversampling the minority class using synthetic data, generating fake samples for our training data to ensure our model has enough Yes's to discover a pattern.
Rest assured, we're not just going to duplicate data points; this technique generates new samples by focusing on each instance in our minority class data and its nearest neighbors, which are also in our data, and creates the synthetic data between the two points, ensuring diversity and that the data generated stays within the feature's space.

In [42]:
from imblearn.over_sampling import SMOTE
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Fit and transform the training data
X_train_, y_train_ = smote.fit_resample(X_train, y_train)

Model Training & Evaluation¶

In [43]:
# Initialize our model 
clf = DecisionTreeClassifier( random_state=42)

# Train the model
clf.fit(X_train_, y_train_)
Out[43]:
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
In [44]:
# Make predictions using our model
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Generate classification report
report = classification_report(y_test, y_pred)

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)

plt.figure(figsize=(7, 5))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
Accuracy: 0.86
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.91      0.92     15281
           1       0.40      0.47      0.43      1997

    accuracy                           0.86     17278
   macro avg       0.66      0.69      0.67     17278
weighted avg       0.87      0.86      0.86     17278

Conclusion¶

  • The model achieved a strong accuracy of 86%, performing notably well in predicting "No" outcomes, as indicated by a precision of 0.93. However, its performance in predicting "Yes" outcomes, with a precision of 0.40 and recall of 0.47, suggests room for improvement. While the model remains useful, especially in identifying "No" cases. Enhancing its effectiveness at predicting "Yes" outcomes could benefit from additional data collection especially for those who said yes. Our model is limited in what it's able to achieve but we argue it could still prove useful and it will only get better with more data to work with.